Red Wine Quality Exploration by Yiyi Tang

Goal of this project:
Using R and other exploratory data analysis techniques, explored relationships
within the red wine quality dataset: how chemical properties influence the
quality of red wine among others.

Backgroud Information of the dataset

RedWineQuality dataset contains 1599 red wine observations of 12 variables
(chemical properties of wine).The output variable ‘quality’ (based on sensory
data) were scored between 0 (very bad) and 10 (very excellent).

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

All variables are numeric type except for quality, which is integer.

Univariate Plots Section

Quality

Before exploration, I’d like to see the summary of ‘quality’ variable:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

As above, the range of wines quality was from 3 to 8 in this dataset. 82.5% of wine
were scored between 5 and 6.

Quality groups

## quality_bucket
##       Bad (Rating 3 - 4)       Normal (Rating 5)           Good (Rating 6) 
##                       63                      681                      638 
## Excellent (Rating 7 - 8) 
##                      217

I created quality_bucket to group quality ratings. Wines receiving 3 and 4 quality
score grouped in “Low” quality_bucket, wines receiving 5 and 6 quality score
grouped in “Medium” quality_bucket, and wines receiving 7 and 8 quality score
grouped in “High” quality_bucket.

Distributions of each chemical variable

Let’s look at each chemical variable’s distributions:

Most distributions tend to be positively skewed. I noticed that the distributions
of fixed.acidity, density and pH are symmetric.

It’s clear to see outliers existed in above box plots.

Remove Outliers

According to descriptive statistic concepts, I defined data points outside 1.5
times the interquartile range above the upper quartile and bellow the lower
quartile as outliers. Since most distributions are positively skewed, meaning
most of outliers are on the larger side. I decided to remove outliers if it is
greater than Q3 + 1.5IQR.

Let’s see the distributions of each variable after removing outliers:

After removing the outliers, distributions of citric.acid, free.sulfur.dioxide,
total.sulfur.dioxide and alcohol still remain slightly positively skewed.But from
boxplots, we can see the most of outliers have been removed. Overall, distributions
of each variable tend to be symmetric.

Distribution of alcohol after removing outliers

alcohol_lowerq = quantile(wine$alcohol)[2]
alcohol_upperq = quantile(wine$alcohol)[4]

alcohol_upper = (IQR(wine$alcohol) * 1.5) + alcohol_upperq
alcohol_lower = alcohol_lowerq - (IQR(wine$alcohol) * 1.5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.10   10.39   11.00   13.40

After removing outliers, the mean of alcohol is 10.39.There is a peak around 9.3.

Let’s look alcohol distributions in quality buckets:

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

It seems to be a tendency that wines with higher alcohol mean tend to have better
quality.

## # A tibble: 6 x 4
##   quality alco_mean alco_median     n
##     <int>     <dbl>       <dbl> <int>
## 1       3  9.955000       9.925    10
## 2       4 10.265094      10.000    53
## 3       5  9.899706       9.700   681
## 4       6 10.629519      10.500   638
## 5       7 11.465913      11.500   199
## 6       8 12.094444      12.150    18

I grouped a subset table ‘wine.alco_by_quality’, describing alcohol
categorized in quality. I noticed that wine quality scoring 8 has the highest
alcohol mean around 12.09%, and the highest median around 12.15%. But these cannot
proof any linear or correlations yet.

I looked at the mean and median of alcohol in each quality category, and I’m
curious to find out if alcohol influence the quality of wine. And if there’s
other variables together with alcohol influence the quality.

Univariate Analysis

What is the structure of your dataset?

There are 1599 wine observations in the dataset with 12 features
(fixed acidity, volatile acidity, citric acid, residual sugar, chlorides,
free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and
quality). The output variable quality is based on sensor data, scoring between
0 and 10.

I set the ‘quality’ variable as ordered factor variable. Its levels are showed
as below:

(very bad) —–> (very excellent)

quality: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

While in the dataset, quality variable ranges between 3 and 8. I grouped quality
into buckets: Bad bucket (rating 3-4), Normal bucket(rating 5), Good bucket (rating 6),
and Excellent bucket (rating 7-8).

Other observations:

  • Most distributions tend to be positively skewed. I noticed that the distributions
    of fixed.acidity, density and pH are symmetric.
  • I found some outliers in the dataset, and then plotted each variable’s distribution
    before and after removing these outliers.
  • Most wine’s quality are 5 and 6 in the dataset.
  • The mean alcohol is 10.42%, and the median alcohol is 10.20%. After removing outliers,
    the mean alcohol is 10.39%.
  • The min quality of wine in the dataset is 3, the max quality is 8,
    and the mean quality is 5.636.
  • About 75% of wine contains 2.6 g / dm^3 residual.sugar.
  • The mean citric.acid is 0.271 g / dm^3, and the max citric.acid is 1 g / dm^3.

What is/are the main feature(s) of interest in your dataset?

The main features of interest in my dataset are quality, alcohol, density and
citric.acid. I’d like to know which feature or features combination are best
for predicting the quality of wine.

I suspect alcohol or citric.acid and some combination of the other variables
can influence the quality of wine. This suspection may help me build a
predictive model for wine quality in the following analysis.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Features like pH, density will help support my investigation because I suspect
alcohol might influence the density of water in wine, and pH might be influenced
by alcohol and citric.acid. However, I will decide the final list of features I
will explore at the next section. By using a correlation matrix plot at the next
section, I will be more confident to choose interested features to explore.

Did you create any new variables/tables from existing variables in the dataset?

I created a subset named ‘wine.alco_by_quality’ to better see if there’s
correlation between these two variables.

Also, I created quality_bucket to group quality ratings. Wines receiving 3 and 4 quality
score grouped in “Low” quality_bucket, wines receiving 5 and 6 quality score
grouped in “Medium” quality_bucket, and wines receiving 7 and 8 quality score
grouped in “High” quality_bucket.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

  • I found some outliers in the dataset, and plotted box plots for each variables
    to see where did these outliers lie in. For example, I found some outliers in
    alcohol variable (below 9 or above 14). Also, I noticed that the best quality
    category has the biggest mean 12.09 and median of alcohol 12.88. But it doesn’t
    mean any liner or correlation between alcohol and quality. I will further
    analysize them in the following section.

  • Citric.acid distribution has several peaks and is slightly skewed to the
    right. The highest peak is at 0.00, and there’s another 3 relatively small
    peaks in the distribution. I also noticed an outliner, which is at 1.00. I
    checked the wine with 1.00 citric.acid and found it is in quality 4.

Bivariate Plots Section

Instead of randomly ploting any potential relationships among these variables,
I used a correlation matrix plot to find meaningful correlations between variables.
I’m interested in color scale that extends from -0.2 to -1, or from 0.2 to 1. These
ranges of scales represent medium or strong correlations between given variables.

Firstly, I noticed there are 4 variables have meaningful correlations with quality.
They are alcohol, sulphates, citric.acid and volatile.acidity. Secondly, I found the
correlations between density and fixed.acidity, pH and fixed.acidity are moderate.
Also, I’m interested to explore the relationships of alcohol and density.

To sum it up, I’m going to explore the relationships between:

Furthermore, I assume there will be more than just 1 variable influencing the quality
of red wines. Therefore, I will test the correlation of quality and different combinations
of alcohol, sulphates, citric.acid and volatile.acidity variables.

In addition, I noticed 2 strong but meaningless correlations, which are citric.acid
and fixed.acidity, total.sulfur.dioxide and free.sulfur.dioxide. Their correlations
are strong becasue one is the other’s subset.

Relationship between alcohol and quality

I removed outliners in alcohol to see if the relationship between alcohol and
quality would be stronger. It turned out just a little bit stronger. So It’s
better to use Pearson’s correlation to test these two. And maybe there’s more
variables participate into this relationship.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

Above Pearson’s correlation result shows there’s a moderate correlation between
alcohol and quality. To be more specific, wine with higher alcohol tend to be
in better quality.

Relationship between alcohol and density

## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$density
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

There’s a moderate negative correlation between alcohol and density variables. To be
specific, wine with higher alcohol tend to have lower density.

Relationship between citric.acid and quality

## 
##  Pearson's product-moment correlation
## 
## data:  wine$citric.acid and wine$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

There’s a small positive correlation between citric.acid and quality.

## quality_bucket: Bad (Rating 3 - 4)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0200  0.0800  0.1737  0.2700  1.0000 
## -------------------------------------------------------- 
## quality_bucket: Normal (Rating 5) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## quality_bucket: Good (Rating 6)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## quality_bucket: Excellent (Rating 7 - 8)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3000  0.4000  0.3765  0.4900  0.7600

Better quality wine have higher citric.acid mean. Excellent wine have the highest
citric.acid mean around 0.3765 g / dm^3 and the highest citric.acid median around
0.4 g / dm^3.

Although citric.acid would add ‘freshness’ or flavor to wine, there’s few correlation
between quality and citric.acid. But there’s a tendency that better quality wine
has higher mean citric.acid.

Relationship between density and fixed.acidity

## 
##  Pearson's product-moment correlation
## 
## data:  wine$fixed.acidity and wine$density
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6399847 0.6943302
## sample estimates:
##       cor 
## 0.6680473

fixed.acidity and density have a moderate postive correlation.

Bivariate Analysis

Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

  • I used a correlation matrix plot to find moderate or strong correlations
    between given variables. There are 4 variables have meaningful correlations with quality.
    They are alcohol, sulphates, citric.acid and volatile.acidity. I explored correlations
    between: alcohol and quality, citric.acid and quality, density and fixed.acidity,
    alcohol and density.

  • I found alcohol and quality have a moderate correlation that wine with
    higher alcohol tend to be in better quality.The correaltion is around 0.476.
    Fixed.acidity and density have a moderate postive correlation.

  • Few correlation is existed between quality and citric.acid. But I found that
    better quality wine has higher mean citric.acid. For example, Excellent wine
    have the highest citric.acid mean around 0.3765 g / dm^3 and the highest
    citric.acid median around 0.4 g / dm^3.

  • There’s a moderate correlation between alcohol and density. To be specific, wine with higher alcohol tend to have lower density. The correlation
    is around -0.496.

What was the strongest relationship you found?

fixed.acidity and density have the strongest correlation around 0.67.

Multivariate Plots Section

Alcohol and density in quality category

Four quality groups followed the relationship between density and
alcohol.

Citric.acid and pH in quality category

Quality groups follow the relationship of pH and citric.acid. The low quality
group has a relatively bigger range of citric.acid. Also, I noticed there’s a lot
medium quality wine have 0 citric.acid, compared to low and high quality groups.

Citric.acid and density in quality category

Quality groups all followed the positive correlation between fixed.acidity and density.
As fixed.acidity volumn increases, the density of wine tends to increase.

Calculate r-squared value

By calculating r-squared value, I want to test if the strongest variable alcohol
would strong r-squared value to proof its linear relationship with quality.

m1 <- lm(wine$quality ~ wine$alcohol)

summary(m1)$r.squared
## [1] 0.2267344

I chose alcohol (have the strongest correlation with quality among other variables)
to test the lineary relation with quality. Unfortunately, the r-squared is not
strong (0.22673).

I decide to add more variables into this model to see if the r-squared value would
improve. If the value increase, it means that the combination of the added variables
and alcohol together influence the quality of wines.

From previous correlation matrix plot, I noticed that other than alcohol, there
are 3 variables have meaningful correlations with quality. They are sulphates,
citric.acid and volatile.acidity. Let’s add these variables one at a time.

m1 <- lm(wine$quality ~ wine$alcohol)
m2 <- lm(wine$quality ~ wine$alcohol+wine$sulphates)
m3 <- lm(wine$quality ~ wine$alcohol+wine$density+wine$citric.acid)
m4 <- lm(wine$quality ~ wine$alcohol+wine$density+wine$citric.acid+wine$volatile.acidity)

summary(m1)$r.squared
## [1] 0.2267344
summary(m2)$r.squared
## [1] 0.2698912
summary(m3)$r.squared
## [1] 0.2576685
summary(m4)$r.squared
## [1] 0.3189737

When I added each of the variables of interest into this model, the r-squared
value did improve from 0.22673 to 0.32. It means these variables combined together
to influence the overall quality of wines. But the final r-squared value is still
not strong enough (I consider an r-squared value over 0.5 as strong enough).

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

  • From bivariate analysis I found out that density and alcohol have a modereate
    negative correlation. And from multivarite analysis by adding quality groups into
    the plot, I found out that quality gourps follow the relationship of density and
    alcohol.

  • Quality groups all followed the positive correlation between fixed.acidity and
    density. As fixed.acidity volumn increases, the density of wine tends to increase.

  • I noticed that among my interested variables, alcohol has the strongest
    relationship with quality. So I calculated its r-squared value. Although the
    r-squared value between them is not strong (around 0.22673), it did improve
    from 0.22673 to 0.32 when I added variables sulphates,citric.acid and volatile.acidity
    one at a time.

Were there any interesting or surprising interactions between features?

  • Depending on the Pearson correlation value, I thought the r-squared value
    between alcohol and quality must be strong, at least bigger than 0.5. But it
    turned out my suspection was wrong. But it did surprised my that the r-squared
    value increased every time I added another featured variables into the model.

  • It also surprised me that quality groups all follow the meaningful relationships
    which I found in bivariate analysis. To be specific, quality groups follow the
    relationships of alcohol and density, fixed.acidity and density, pH and citric.acid.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I created a math linear model.I sed quality as dependent variable, and alcohol as
independent variable. After I found out the r-squared value is not strong enough,
I added sulphates,citric.acid and volatile.acidity one at a time as independent
variable into the model. The result r-squared value did improve to 0.32, but still not
strong enough.

The model clearly shows each r-squared value when you added a new featrued variable.
So it’s easy and clear to see the result that if they have linear correlation.
But there’s limitations of this model. Since I didn’t put all the variables in the
dataset to test the model. There may still be some potential variables which can
influence the quality of wines that I didn’t include in the model.


Final Plots and Summary

Plot One

Description One

Alcohol and density have a moderate negative correlation around -0.496. Wine with
higher alcohol percentage by volume tend to have lower density (g / cm^3). And
all wine quality groups follow the relationship of alcohol and density.

Plot Two

Description Two

Alcohol have strongest correlation with quality around 0.476. Wines with higher
alcohol percentage by volume tend to be in better quality.But I did notice that
wine with quality scoring 5 is a bit out of the line. It might because there’s
still potential variables (toghether with alcohol to influence quality) that I
didn’t discuss.

Plot Three

Description Three

Fixed.acidity (tartaric acid - g / dm^3) and density (g / cm^3) have a moderate
postive correlation around 0.67. Wine with higher fixed.acidity tend to have higher
density. And all wine quality groups follow this relationship of fixed.acidity
and density.

Reflection

This Red Wine Quality dataset contained 1,599 observations of red wines. There’re
12 variables in the dataset, including 11 variables of chemical properties in
these wines, and 1 output variable of wine quality, which graded by experts and
is between 0 (very bad) and 10 (very excellent).

I’m interested in exploring how these chemical properties influence the quality
of wine. I ploted a correlation matrix plot to decide variables for further
exploring. Through univariate, bivariate, multivariate analysis and statistical
analysis, I tested different meaningful relationships between these variables.

Among the variables included in the dataset, alcohol had the strongest correlation
with wine quality. The correlation is around 0.476. Wines with higher alcohol
percentage by volume tend to be in better quality. Unfortunately, the calculated
r-squared value between alcohol and quality is not strong (around 0.22673).

I decide to add other correlated variables which I found from the correlation
matrix plot. I added sulphates, citric.acid and volatile.acidity one at a time
into the model. It turned out that the r-squared value did improve from 0.22673
to 0.32.

I think the limitations of this dataset would be one of the major challenges.
Amond 1,599 obeservations of wines, 82.4% of wines received score of 5 or 6.
Around 4% of wines received score of 3 or 4, and 13.6% of wines received score
of 7 or 8. It would be better to have a larger variety of quality score for the
dataset.

For future further analysis, it would be interesting and meanfing to combine or
compare this dataset with the white wine datast. So we can see how these chemical
properties’ correlation with quality changed.

Reference: